NetCDF save with dask support #2411

bjlittle · 2017-02-27T10:23:07Z

Support for streamed netCDF saving using dask.

Introduces the convenience function iris._lazy_data.convert_nans_array to deal with converting a NaN array to either a masked array of a filled ndarray.

bjlittle · 2017-03-08T11:59:00Z

@pp-mo I assume that I now just re-target this PR against dask branch?

bjlittle · 2017-03-09T11:16:40Z

lib/iris/_lazy_data.py

+            else:
+                array = ma.masked_array(array, mask=mask,
+                                        fill_value=fill_value)
+    return array


@pp-mo I've combined my array_nans_to_filled and array_nans_to_masked and opted to use the filled kwarg, rather than have separate but very similar functions.

Interested on your take with my raising an exception for the filled = True case but fill_value = None ... seems the right thing to do.

Also, that this routine will perform dtype casting ... again, seems to me like the best place to do it, but whether that capability wrapped up within array_nans_to_masked is appropriate for a higher level function that calls it, such as your as_concrete_data

Again, mine is very very similar here.
Seems like a case of "furious agreement"

bjlittle · 2017-03-09T11:20:02Z

lib/iris/cube.py

+                self._numpy_array = array_nans_to_masked(data,
+                                                         self.fill_value,
+                                                         self.dtype)
+                self.dtype = None


@pp-mo I'm expecting you to have stomped over this with your PR, but hopefully they align in logic ...

hopefully they align in logic

Dead right, mine is spookily similar !

pp-mo

Not muich to argue over here, so check it out

pp-mo · 2017-03-09T11:25:49Z

lib/iris/_lazy_data.py

+        # Finally, mask or fill the data, as required.
+        if np.any(mask):
+            if filled:
+                if fill_value is None:


This doesn't actually work...

if filled: if fill_value is None: ...

I think it needs to be

if filled is not False: if filled is None:

pp-mo · 2017-03-09T11:29:27Z

lib/iris/_lazy_data.py

+        # First, calculate the mask.
+        mask = np.isnan(array)
+        # Now, cast the dtype, if required.
+        if dtype is not None and dtype != array.dtype:


I think for efficiency's sake (aargh!!) this is better done after filling -- we don't want to do a masked 'astype' when it isn't needed.

pp-mo · 2017-03-09T11:31:44Z

lib/iris/_lazy_data.py

+        if np.any(mask):
+            if filled:
+                if fill_value is None:
+                    emsg = 'Invalid fill value, got {!r}.'


As-is there;'s not much point in

if fill_value is None: 'Invalid fill value, got {!r}.'

It might as well say the exact truth.
E.G. "Dask result contains missing data, but no fill value was provided."

pp-mo · 2017-03-09T11:58:01Z

lib/iris/cube.py

+                self._numpy_array = array_nans_to_masked(data,
+                                                         self.fill_value,
+                                                         self.dtype)
+                self.dtype = None


hopefully they align in logic

Dead right, mine is spookily similar !

pp-mo · 2017-03-09T11:58:41Z

lib/iris/_lazy_data.py

+            else:
+                array = ma.masked_array(array, mask=mask,
+                                        fill_value=fill_value)
+    return array


Again, mine is very very similar here.
Seems like a case of "furious agreement"

pp-mo · 2017-03-09T12:34:24Z

lib/iris/_lazy_data.py

    return array
+
+
+def array_nans_to_masked(array, fill_value=None, dtype=None, filled=False):


Name probably needs fixing ?

pp-mo · 2017-03-09T12:36:43Z

lib/iris/_lazy_data.py

+
+def array_nans_to_masked(array, fill_value=None, dtype=None, filled=False):
+    """
+    Convert an array into a masked array, by masking any NaN points.


Needs fix: it does not always return a masked array.
(nor should it, IMHO)

pp-mo · 2017-03-09T12:48:20Z

I think there are three useful usecases

if any missing points then return a masked array (optional : can we set a fill_value ?, do I care ?)
fill any missing points with a given fill_value
raise error if there are any missing points

(+ in future we might consider a "don't even check for NaNs" option for performance-only-reasons)

pp-mo · 2017-03-09T14:22:56Z

Current + possible future call signatures for the above usecases
If array_nans_to_masked() = a-n-t-m() ...

"mask-nans" = "if any missing points then return a masked array
"fill-nans" = "fill any missing points with a given fill_value"
"no-nans" = "raise error if there are any missing points"

current calls

mask-nans : a-n-t-m() [, fill_value=fv]
fill-nans : a-n-t-m(filled=True, fill_value=fv)
no-nans : a-n-t-m(filled=True [, fill_value=None]

future ideas ?
How about a-n-t-m(nans='mask', fill_value=None)

mask-nans : a-n-t-m() [, nans='mask']
fill-nans : `a-n-t-m(nans='fill', fill_value=fv)'
no-nans : a-n-t-m(nans=None)

The above still allows you to specify the fill_value of a masked array it creates.
If that isn't needed then a single key can do it all.
How about...
a-n-t-m(nans=None)

no-nans : a-n-t-m()
mask-nans : a-n-t-m(nans=np.ma.masked)
fill-nans : a-n-t-m(nans=fill_value)

bjlittle · 2017-03-09T14:23:17Z

@pp-mo

if any missing points then return a masked array (optional : can we set a fill_value ?, do I care ?)

Agreed use case, interesting part is whether we care to set the fill_value or just opt to use the default fill_value of the dtype as decreed by numpy. We still need the cube.fill_value as that tells the savers what to do, specific to their fileformat - and anyways the user can change that if they care.

If a user plucks a concrete masked data payload from the cube, for whatever reason, then they can set the fill_value to be what they care about, or to the cube.fill_value if that matters to them.

I guess I'm quickly coming to the conclusion that I can't see a use case for why we would want to set the fill_value on the masked data of a cube .... and I missing something obvious here?

fill any missing points with a given fill_value

Yup ... and fill them with the fill_value that is specified ... and this can only be non-None

raise error if there are any missing points

@pp-mo I'm struggling to see when this might be required ... are you thinking of points and bounds of coordinates here? Can you elaborate?

(+ in future we might consider a "don't even check for NaNs" option for performance-only-reasons)

Yup, but not now, right? That's an optimisation that I'd love for us to be in that "happy place" to make, but now just right now.

pp-mo · 2017-03-09T14:25:31Z

raise error if there are any missing points

@pp-mo I'm struggling to see when this might be required ... are you thinking of points and bounds of coordinates here? Can you elaborate?

Yes, that's exactly what I'm thinking of.

bjlittle · 2017-03-16T07:04:13Z

Ping @pp-mo @lbdreyer @dkillick 😄

bjlittle · 2017-03-16T08:56:35Z

lib/iris/analysis/trajectory.py

            # masked result, but it ensures we use a "filled" version of the
            # input in this case.
+            if cube.fill_value is not None:
+                source_data.fill_value = cube.fill_value


The way trajectory handles masks is pretty ropey ... this change appears appropriate.

The preceding inline code comment says it all really ...

bjlittle · 2017-03-16T11:13:20Z

@pp-mo @lbdreyer @dkillick someone should own this PR and drive the review forward ... tick, tock goes the clock.

bjlittle · 2017-03-16T11:27:56Z

lib/iris/_lazy_data.py

+                # Check the fill value is appropriate for the
+                # target result dtype.
+                try:
+                    [fill_value] = np.asarray([nans], dtype=result_dtype)


This should use the dtype of the array not the result_dtype ...

bjlittle · 2017-03-16T11:36:57Z

lib/iris/_lazy_data.py

+                    [fill_value] = np.asarray([nans], dtype=result_dtype)
+                except OverflowError:
+                    emsg = 'Fill value of {!r} invalid for result {!r}.'
+                    raise ValueError(emsg.format(nans, result_dtype))


Again, this should use array.dtype not result_dtype

bjlittle · 2017-03-16T12:06:49Z

Downloads to travis seem to be getting throttled at the moment ... which is painful 😱

lbdreyer · 2017-03-16T11:11:37Z

lib/iris/_lazy_data.py

+
+    * nans:
+        If `nans` is None, then raise an exception if the `array` contains
+        any NaN values (default).


nans is None is never used so why did you include?

It's the default in the function formal parameter list.

Also, it's a very useful value to default to because it's gonna raise an exception if it detects that there are NaNs in your data. We're forcing people to care about this. So for example, this is good behaviour to adopt as it will force the issue in cases when streamed saving is being attempted with NaN data, but the fill_value isn't set i.e. None. See here in particular.

nans is None is never used

But it is here ?

But it is here ?

I meant when convert_nans_array is used. but I forgot that the fill_value of a cube could be set to None as @bjlittle's comment points out

lbdreyer · 2017-03-16T11:17:46Z

lib/iris/_lazy_data.py

+
+    * nans:
+        If `nans` is None, then raise an exception if the `array` contains
+        any NaN values (default).


I don't think nans is descriptive of the purpose of this keyword argument. Perhaps nans_replacement would be better?

My take on it is, what are my NaN going to be replace with

nans=fill_value is replace NaNs with the fill_value

nans=ma.masked is replace my NaNs with a mask

nan=None is I don't care, there shouldn't be any NaNs in my data, barf if there is

I'm open to a group take on this, as I've been staring at it way too long ... but I'm kinda wedded to the short and sweet nans name ... thoughts?

Okay, that's two votes for change ... that's enough for me ... nans_replacement it is!

nans_substitute is one character shorter if you prefer that

Thanks. First impressions last and nans_replacement is perfect.

lbdreyer · 2017-03-16T11:51:31Z

lib/iris/_lazy_data.py

+                raise ValueError(emsg)
+            elif nans is ma.masked:
+                # Mask the array with the default fill_value.
+                array = ma.masked_array(array, mask=mask)


So, this uses the default fill_value irrespective of what the fill value on the cube is.

Exactly. We no longer care. The fill_value that we do care about lives on the cube.

lbdreyer · 2017-03-16T11:52:34Z

lib/iris/tests/unit/lazy_data/test_convert_nans_array.py

+        self.assertEqual(result.fill_value, expected.fill_value)
+
+    def test_nans_filled(self):
+        fill_value = 666.0


Muh-ha-ha-har!

pp-mo · 2017-03-16T12:44:55Z

someone should own this PR and drive the review forward

I'm looking at it now, but I don't claim to be in charge.
I've only got 15 mins now, then busy 1300-1400.

pp-mo · 2017-03-16T12:45:46Z

Re:

I don't think nans is descriptive of the purpose of this keyword argument

I'm think I'm happy with the routine function, but I will support better naming of arguments !

bjlittle · 2017-03-16T14:15:03Z

@pp-mo @lbdreyer @dkillick Okay guys are there any further objections to this PR?

I've address all of @lbdreyer comments ...

pp-mo

I'm fine with the changes in existing code.

There are a few awkward corners in the behaviour of the core routinem + its testing, which I think at least deserve documenting.

pp-mo · 2017-03-16T15:26:15Z

lib/iris/_lazy_data.py

+        If `nans_replacement` is None, then raise an exception if the `array`
+        contains any NaN values (default behaviour).
+        If `nans_replacement` is `numpy.ma.masked`, then convert the `array`
+        to a :class:`~numpy.ma.core.MaskedArray`.


NOTE: something missing here.
We should explain that in 'masked array' mode the input array is copied, but in "fill or no-nans modes", the result is the input array, which may be modified in-place.

may be modified in-place

P.S. except when we do a change of dtype.
A bit slippery that, isn't it 😬 ??
Perhaps should just say "in some cases, the input is modified in-place" + leave it at that ?

pp-mo · 2017-03-16T15:28:20Z

lib/iris/tests/unit/lazy_data/test_convert_nans_array.py

+        self.assertIs(result, self.array)
+        expected = np.array([[1.0, fill_value],
+                             [3.0, 4.0]])
+        self.assertArrayEqual(result, expected)


Should also check that result is the input (no copy, fill-in-place)

@pp-mo Well, that's already done on line 80 ... 👓

pp-mo · 2017-03-16T15:30:27Z

lib/iris/tests/unit/lazy_data/test_convert_nans_array.py

+        expected = np.array([[1, fill_value],
+                             [3, 4]],
+                            dtype=dtype)
+        self.assertArrayEqual(result, expected)


In this case, result is not the original (not in-place with new dtype).

P.S. did you know you can assign array.dtype ? It does like a C pointer cast 😱 .

pp-mo · 2017-03-16T15:32:46Z

lib/iris/tests/unit/lazy_data/test_convert_nans_array.py

+        self.assertIsInstance(result, ma.MaskedArray)
+        self.assertIs(result, array)
+
+    def test_pass_thru_masked_array_integer(self):


Un -masked ints are pass-through anyway.
Should be checking that.
It's in the docstring!

@pp-mo Yup, good spot. Done!

pp-mo · 2017-03-16T15:36:36Z

lib/iris/_lazy_data.py

+        fill value.
+
+    * result_dtype:
+        Cast the resultant array to this target :class:`~numpy.dtype`.


This will not happen with pass-through cases (anything already masked, or integer types).
We should explain that here, somehow.

@pp-mo This is actually mentioned in the doc-string note. I think that is sufficient.

bjlittle · 2017-03-20T13:44:51Z

lib/iris/_lazy_data.py

+                result = array.data.copy()
+            result[mask] = np.nan
+        else:
+            result = array.data


@lbdreyer This addresses the following numpy warning:

MaskedArrayFutureWarning: setting an item on a masked array which has a shared mask will not copy the mask and also change the original mask array in the future. Check the NumPy 1.11 release notes for more information.

Which was generated by array[mask] = np.nan whenever the data was masked, but the dtype was not integral.

bjlittle · 2017-03-20T13:53:00Z

closes #2347

lbdreyer · 2017-03-20T14:24:26Z

lib/iris/tests/results/netcdf/netcdf_monotonic.cml

 <?xml version="1.0" ?>
 <cubes xmlns="urn:x-iris:cubeml-0.2">
-  <cube standard_name="eastward_wind" units="m s-1" var_name="wind1">
+  <cube core-dtype="int32" dtype="int32" standard_name="eastward_wind" units="m s-1" var_name="wind1">


At some point we need to take out these core-dtype from the cml.
Perhaps at the end of this sprint (considering we want to target fill_value/dtype/get-tests-working this sprint)

lbdreyer · 2017-03-20T14:26:47Z

All comments have now been addressed so I'm gonna merge this in

bjlittle · 2017-03-20T15:26:31Z

@lbdreyer Awesome, thanks!

* NetCDF save with dask support * Refactor and use array_nans_to_masked * Add array_nans_to_masked test coverage. * Refactor. * Working tests. * Update comment detail for convert usage. * Use array.dtype in convert_nans_array * Review actions. * Review actions.

bjlittle added the Status: Work in Progress label Feb 27, 2017

bjlittle added this to the dask milestone Feb 27, 2017

bjlittle added the dask label Feb 27, 2017

pp-mo closed this Mar 8, 2017

pp-mo removed the Status: Work in Progress label Mar 8, 2017

bjlittle changed the base branch from dask_timed to dask March 8, 2017 12:05

bjlittle reopened this Mar 8, 2017

bjlittle added the Status: Work in Progress label Mar 8, 2017

bjlittle commented Mar 9, 2017

View reviewed changes

pp-mo self-requested a review March 9, 2017 12:21

pp-mo reviewed Mar 9, 2017

View reviewed changes

bjlittle added 4 commits March 15, 2017 15:58

NetCDF save with dask support

f4ce322

Refactor and use array_nans_to_masked

80af4ae

Add array_nans_to_masked test coverage.

fa5bff2

Refactor.

693ac38

bjlittle force-pushed the dask-netcdf-save branch 3 times, most recently from 237e124 to 6d2f381 Compare March 16, 2017 02:39

Working tests.

570f2ca

bjlittle force-pushed the dask-netcdf-save branch from 6d2f381 to 570f2ca Compare March 16, 2017 02:48

bjlittle commented Mar 16, 2017

View reviewed changes

bjlittle requested review from DPeterK and lbdreyer March 16, 2017 11:13

bjlittle commented Mar 16, 2017

View reviewed changes

Use array.dtype in convert_nans_array

b3ac9fc

lbdreyer requested changes Mar 16, 2017

View reviewed changes

Review actions.

4f062a7

pp-mo requested changes Mar 16, 2017

View reviewed changes

lbdreyer approved these changes Mar 16, 2017

View reviewed changes

pp-mo mentioned this pull request Mar 16, 2017

Replace dask "computes" with common call #2422

Closed

Review actions.

6400ab2

bjlittle commented Mar 20, 2017

View reviewed changes

bjlittle assigned lbdreyer Mar 20, 2017

lbdreyer reviewed Mar 20, 2017

View reviewed changes

lbdreyer merged commit 8e3ede4 into SciTools:dask Mar 20, 2017

QuLogic removed the Status: Work in Progress label Mar 20, 2017

bjlittle deleted the dask-netcdf-save branch March 20, 2017 15:27

QuLogic modified the milestones: dask, v2.0 Aug 2, 2017

		return array


		def array_nans_to_masked(array, fill_value=None, dtype=None, filled=False):

NetCDF save with dask support #2411

NetCDF save with dask support #2411

Uh oh!

Conversation

bjlittle commented Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjlittle commented Mar 8, 2017

Uh oh!

bjlittle Mar 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo commented Mar 9, 2017

Uh oh!

pp-mo commented Mar 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjlittle commented Mar 9, 2017

Uh oh!

pp-mo commented Mar 9, 2017

Uh oh!

bjlittle commented Mar 16, 2017

Uh oh!

bjlittle Mar 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjlittle commented Mar 16, 2017

Uh oh!

bjlittle Mar 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjlittle commented Mar 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjlittle commented Feb 27, 2017 •

edited

Loading

bjlittle Mar 9, 2017 •

edited

Loading

pp-mo commented Mar 9, 2017 •

edited

Loading

bjlittle Mar 16, 2017 •

edited

Loading

bjlittle Mar 16, 2017 •

edited

Loading

bjlittle commented Mar 16, 2017 •

edited

Loading

pp-mo left a comment •

edited

Loading

bjlittle Mar 20, 2017 •

edited

Loading